NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A Systematic Study of Popular Software Packages and AI/ML Models for Calibrating In Situ Air Quality Data: An Example with Purple Air Sensors

https://doi.org/10.3390/s25041028

Smith, Seren; Trefonides, Theodore; Srirenganathan_Malarvizhi, Anusha; LaGarde, Shyra; Liu, Jiakang; Jia, Xiaoguo; Wang, Zifu; Cain, Jacob; Huang, Thomas; Pourhomayoun, Mohammad; et al (February 2025, Sensors)

Accurate air pollution monitoring is critical to understand and mitigate the impacts of air pollution on human health and ecosystems. Due to the limited number and geographical coverage of advanced, highly accurate sensors monitoring air pollutants, many low-cost and low-accuracy sensors have been deployed. Calibrating low-cost sensors is essential to fill the geographical gap in sensor coverage. We systematically examined how different machine learning (ML) models and open-source packages could help improve the accuracy of particulate matter (PM) 2.5 data collected by Purple Air sensors. Eleven ML models and five packages were examined. This systematic study found that both models and packages impacted accuracy, while the random training/testing split ratio (e.g., 80/20 vs. 70/30) had minimal impact (0.745% difference for R2). Long Short-Term Memory (LSTM) models trained in RStudio and TensorFlow excelled, with high R2 scores of 0.856 and 0.857 and low Root Mean Squared Errors (RMSEs) of 4.25 µg/m3 and 4.26 µg/m3, respectively. However, LSTM models may be too slow (1.5 h) or computation-intensive for applications with fast response requirements. Tree-boosted models including XGBoost (0.7612, 5.377 µg/m3) in RStudio and Random Forest (RF) (0.7632, 5.366 µg/m3) in TensorFlow offered good performance with shorter training times (<1 min) and may be suitable for such applications. These findings suggest that AI/ML models, particularly LSTM models, can effectively calibrate low-cost sensors to produce precise, localized air quality data. This research is among the most comprehensive studies on AI/ML for air pollutant calibration. We also discussed limitations, applicability to other sensors, and the explanations for good model performances. This research can be adapted to enhance air quality monitoring for public health risk assessments, support broader environmental health initiatives, and inform policy decisions.
more » « less
Full Text Available
The spatial dynamics of Ukraine air quality impacted by the war and pandemic

https://doi.org/10.1080/17538947.2023.2239762

Malarvizhi, Anusha Srirenganathan; Liu, Qian; Trefonides, Theodore S.; Hasheminassab, Sina; Smith, Jennifer; Huang, Thomas; Marlis, Kevin M.; Roberts, Joe T.; Wang, Zifu; Sha, Dexuan; et al (October 2023, International Journal of Digital Earth)

Full Text Available
Improving search ranking of geospatial data based on deep learning using user behavior data

https://doi.org/10.1016/j.cageo.2020.104520

Li, Yun; Jiang, Yongyao; Yang, Chaowei; Yu, Manzhu; Kamal, Lara; Armstrong, Edward M.; Huang, Thomas; Moroni, David; McGibbney, Lewis J. (September 2020, Computers & Geosciences)
null (Ed.)
Full Text Available
Enhance Visual Recognition under Adverse Conditions via Deep Networks

https://doi.org/10.1109/TIP.2019.2908802

Liu, Ding; Cheng, Bowen; Wang, Zhangyang; Zhang, Haichao; Huang, Thomas S. (January 2019, IEEE Transactions on Image Processing)

Full Text Available
A Cloud-Based Framework for Large-Scale Log Mining through Apache Spark and Elasticsearch

https://doi.org/10.3390/app9061114

Li, Yun; Jiang, Yongyao; Gu, Juan; Lu, Mingyue; Yu, Manzhu; Armstrong, Edward; Huang, Thomas; Moroni, David; McGibbney, Lewis; Frank, Greguska; et al (March 2019, Applied Sciences)

The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and finding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efficiently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework.
more » « less
Full Text Available
An Integrated Data Analytics Platform

https://doi.org/10.3389/fmars.2019.00354

Armstrong, Edward M.; Bourassa, Mark A.; Cram, Thomas A.; DeBellis, Maya; Elya, Jocelyn; Greguska, Frank R.; Huang, Thomas; Jacob, Joseph C.; Ji, Zaihua; Jiang, Yongyao; et al (July 2019, Frontiers in Marine Science)

Full Text Available

Search for: All records